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INTRODUCTION 


HIs study was undertaken to com- 
fe the forced choice technique 
with other methods for evaluating the 
performance of commissioned _profes- 
sional health personnel in the United 
States Public Health Service. It is a part 
of the officer selection and evaluation 
program described by Newman (7). 


The Public Health Service, the major Federal 
organization responsible for the health of the 
nation, employs approximately 16,000 Civil Serv- 
ice personnel and 3,000 commissioned officers. 
The commissioned officer component of the 
Service is composed of carefully selected per- 
sonnel in various scientific specialties and in 
the health professions of medicine, dentistry, 
nursing, sanitary engineering, pharmacy, veter- 
inary medicine, dietetics, and physical therapy. 
These officers hold clinical, research, public 
health, and administrative positions in Service 
hospitals, outpatient clinics, research centers, 
regional offices, other governmental agencies, and 
foreign countries. 

The performance evaluation of commissioned 
officers is accomplished through periodic effici- 
ency reporting. In an effort to improve the 
Service’s performance-rating system, an investi- 
gation of methods for evaluating job perform- 
ance was undertaken in 1949. A review of the 
literature in the performance-rating field revealed 
that no data had been reported then, or at 
present, on the kinds of highly trained scientific 
and professional personnel employed in the 
varied and specialized work areas in the Service. 


A consideration of various perform- 
ance evaluation methods led to the con- 


*Formerly with the United States Public 
Health Service. 


clusion that the forced choice technique 
developed by the Department of the 
Army appeared most promising for use 
in the Public Health Service setting. 
While investigations of the forced choice 
technique have been based on popula- 
tions which are quite different from 
Public Health Service professional per- 
sonnel, reports by Sisson (10), Witsell 
(12), and the Adjutant General’s Office, 
Department of the Army (13, 14) seemed 
to indicate the usefulness of the tech- 
nique in a commissioned personnel 
system. 

An Experimental Efficiency Report, in- 
corporating forced choice items and 
other evaluation materials not in use in 
the Service in 1949, was designed and 
distributed to the supervisors of active- 
duty officers. The effectiveness of the 
forced choice technique is here com- 
pared with that of the more conventional 
evaluation methods included in the Ex- 
perimental Report. It is anticipated that 
the results of this study will contribute 
to the literature on performance evalua- 
tion and, in particular, to that of forced 
choice methodology. The findings may 
also be of special interest to local or state 
health departments, hospitals, research 
institutions, or other organizations em- 
ploying professional personnel engaged 
in public health work, medical care, or 
research relevant to the problems of 
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health and disease. In addition, this 
study calls attention to some of the prob- 
lems implicit in the evaluation of rela- 
tively small numbers of employees en- 
gaged in a variety of specialized profes- 
sional activities. 


PROBLEMS 


An investigation of performance-evalu- 
ation methods was undertaken to answer 
the following questions: 

1. Is the forced choice technique an 
effective method for measuring the per- 
formance of professional health person- 
nel in the Public Health Service? 

2. How does the forced choice tech- 
nique compare in validity and reliability 
with more conventional methods of per- 
formance evaluation? 

3. How do factors such as the charac- 
teristic evaluated in criterion ratings, the 
administrative level of the supervisor 
completing efficiency reports, and the 
grade of the ratee affect the validity of 
efficiency reporting? 

4. What combination of efficiency- 
reporting methods optimally predicts 
the performance of personnel in the vari- 
ous professional and occupational fields 
of the Service? 

In addition to the major problems of 
the study, it was possible to compare the 
validity of the Experimental Report with 
that of the Officer’s Progress Report (the 
efficiency report in use in the Service in 
1949) which had been completed under 
operational rather than experimental 
conditions, 


MATERIALS 
Criteria 
The criteria of performance within 
the Public Health Service consisted of 
20-point graphic rating scales used for 
the evaluation of each of the following 
factors: Work Performance, Administra- 


tive Ability, Personality (Personal Quali- 
fications), and Over-all Value to the 
Service. Instructions for each scale re- 
quested raters to compare the ratee with 
a typical group of personnel having simi- 
lar duties and responsibilities. A rating 
of 1 was used to designate the least ef- 
fective ratees, and a rating of 20 was used 
to designate the most effective. 


Experimental Efficiency Report 


This Report was divided into the 
following four sections, samples of which 
may be seen in the Appendix. 

Section I—Forced Choice. This part 
consisted of 50 tetrads adapted from 
items developed by the Department of 
the Army (10). Each tetrad was com- 
posed of four words or phrases descrip- 
tive of job performance or personal 
qualifications from which a supervisor 
was to select (a) the one most descriptive 
and (b) the one least descriptive of the 
individual he was rating. A preliminary 
investigation had shown that the Army 
tetrads adapted for use with Public 
Health Service personnel produced a 
promising number of scorable alterna- 
tives (9). 

Section II—Job Proficiency. This was 
a list of ten major work areas in the 
Public Health Service from which a 
supervisor was to indicate the ratee’s pri- 
mary job function. The supervisor was 
then requested to rate, on a ten-point 
scale, the quality of the ratee’s perform- 
ance in this function. 

Section IlI—Personal Qualifications. 
This section consisted of ten-point rating 
scales for the evaluation of eight person- 
ality characteristics such as reaction to 
criticism, freedom from bias and emo- 


* Appreciation is expressed to the Personnel 
Research Branch, The Adjutant General’s Office, 
Department of the Army, for making these ma- 
terials available. 
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tional upset, ability to work with others, 
ability to act on own responsibility, and 
diligence and persistence in performing 
necessary work. 

Section I1V—Check List. This part con- 
sisted of 22 statements which were to 
be marked as applying or not applying 
to the ratee. The statements concerned 
professional knowledge, interest in work, 
planning and organizing ability, leader- 
ship, versatility, and other characteristics 
related to work performance. In the de- 
velopment of the Check List, some 500 
statements had been extracted verbatim 
from “Remarks” sections of the Officer’s 
Progress Report, placed in 12 logical 
categories, and sorted according to Thur- 
stone’s variation of the method of equal- 
appearing intervals (8, 11). The 22 
statements comprising the final check list 
were those which showed the least varia- 
bility (smallest semi-interquartile range) 
and which were deemed most relevant to 
performance. 


Officer’s Progress Report 


Samples of parts of the Progress Re- 
port used in 1949 to obtain periodic ef- 
ficiency ratings on commissioned person- 
nel are shown in the Appendix. The 
Progress Report contains two types of 
evaluations: 

Rating Scales. This section consists 
of 11 five-point rating scales for evaluat- 
ing such factors as judgment, general 
professional knowledge, proficiency in 
assigned duties, industry, tact, initiative, 
and dependability. The scales are scored 
by a point system, odd number values of 
one through nine being assigned to the 
five points on each scale; values from all 
scales are averaged to obtain a total score. 

Narrative Comments. Several ques- 
tions in the Progress Report also elicit 
narrative comments concerning a ratee’s 
performance. A method for scoring these 


comments was developed (8). Its use 
involves assigning to a comment in a 
Progress Report the scale value of a 
matching comment in a scoring manual. 
The total score for the Narrative Com- 
ments is the average of the scale values 
of all comments in a Report. 

Total raw scores from the Narrative 
Comments and the Rating Scales are 
separately converted to standard scores 
on the basis of norms established for 
each officer grade and profession.* 


COLLECTION OF DATA 


Criterion ratings were obtained dur- 
ing 1949 from 45, of the 54 Public Health 
Service installations in the United States, 
including 14 hospitals, 10 regional of- 
fices, 8 divisions, 8 laboratories of the 
National Institutes of Health, and 5 
other installations such as outpatient 
clinics. Nine stations were excluded from 
the study because of practical considera- 
tions and the small numbers of possible 
ratees at most of these stations. 


Ratings were obtained in a systematic manner 
by a staff representative who explained the 
method of rating and administered the forms. 
Officers who worked together, regardless of pro- 
fession or grade, met in groups and rated each 
other. For each of the rating factors, which 
were randomly alternated, each officer was pro- 
vided with a roster of all officers at his station. 
He was then asked to rate each officer, excluding 
himself and officers he did not know, by plac- 
ing the ratee at one of the points on the twenty- 
point scale. Ratings were performed anony- 
mously with the assurance that they were to be 
used for research purposes only. The lower 
grades of officers (the equivalents of the Navy 
ensign through full lieutenant) were rated on 


*Public Health Service officer grades and 
their Navy equivalents are given below: 
Public Health Service 
Junior assistant 
Assistant 
Senior assistant 
Full 
Senior 
Director 


Navy 

Ensign 

Lieutenant (j.g.) 
Lieutenant 

Lieutenant commander 
Commander 

Captain 


4 S. 


one scale and the higher grades on another in an 
effort to reduce irrelevant grade-associated fac- 
tors which might affect the ratings. 

One month after the collection of criterion 
data, copies of the Experimental Efficiency Re- 
port, directions for completing, and a schedule 
for designating supervisors to mark the Experi- 
mental Reports were mailed to the officers in 
charge of the installations which had been 
visited for purposes of collecting criterion ratings. 
Two independent Reports were requested on 
each ratee, one from his immediate officer su- 
pervisor and another from either the officer in 
charge or his representative. 

The Officer’s Progress Report is requested 
annually for all officers at the Full grade (Navy 
lieutenant commander) and above, and semi- 
annually for all officers below this grade. For 
purposes of the present study, one Progress 
Report was selected, where possible, for each 
officer on whom an Experimental Report had 
been completed. The Progress Reports selected 
were those completed within six months before 
or after the Experimental Reports. In most 
instances, this time control resulted in a match- 
ing of the two Reports on ratee’s Service pro- 
fession, grade, corps, and station. Progress Re- 
. ports which did not match the Experimental 
Reports on these factors were eliminated from 
the study. 


DEVELOPMENT OF EXPERIMENTAL 
REPORT ScorING Keys 


Designation of Occupational Groups for 
Item Analysis 


The diversity of professions and job 
functions within the Public Health Serv- 
ice necessitated the designation of sep- 
arate occupational groups for which Ex- 
perimental Report scoring keys could be 
developed. An early analysis of the Ex- 
perimental Report showed that items 
scorable for various ratee professions did 
not appreciably overlap (9). In view of 
these considerations, the following fac- 
tors were controlled in designating 
groups for item-analysis purposes. 

Criteria. Intercorrelations among 
scores on the four criteria (not shown in 
tabular form) revealed that Work Per- 
formance and Personality produced the 
lowest intercorrelations, ranging from 
.50 to .72; all other correlations were 


NEWMAN, M. A. HOWELL, AND F. J. HARRIS 


higher, ranging from .74 to .g2. The de- 
cision was made to use only the two 
more independent criteria, Work Per- 
formance and Personality, for purposes 
of item analyzing the Experimental Re- 
port. 

Station. This factor was controlled by 
type of station. Stations were classified 
into three groups according to their ma- 
jor functions: (a) medical care, furnished 
in installations such as hospitals and out- 
patient clinics; (b) public health work, 
carried on in regional offices and in such 
divisions as those of the Bureau of 
State Services; and (c) research, per- 
formed in installations such as the Na- 
tional Institutes of Health. 

Profession of rater. Among medical 
care personnel (in hospitals and outpa- 
tient clinics), where the number of cri- 
terion ratings permitted, profession of 
rater was controlled by using only those 
ratings performed by members of the 
ratee’s profession. For nurses working 
in medical care, two groups of raters were 
used: physicians and nurses, In the pub- 
lic health and the research groups, estab- 
lished by the station control, ratings by 
all professions of raters were used be- 
cause of the small numbers of personnel 
in any one profession. Further, public 
health and research personnel are usu- 
ally given efficiency ratings by supervisors 
working in the same functional area, but 
not in the same profession. Medical care 
personnel, however, not only receive 
efficiency ratings from supervisors in the 
medical care field, but frequently from 
those in the particular profession of the 
ratee. 

Profession of ratee. For purposes of 
item analysis, this variable was controlled 
among medical care personnel by the 
establishment of separate groups accord- 
ing to the three major professions rep- 
resented—medicine, dentistry, and nurs- 
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TABLE 1 


MEANS AND STANDARD DEVIATIONS OF CRITERION SCORES IN EACH OCCUPATIONAL 
Group DESIGNATED FOR ITEM ANALYSIS 


Work Performance Criterion 


Personality Criterion 


Occupational Group 


SD 


Physicians (1)* 

Physicians (2) 

Public health personnel (1) 
Public health personnel (2) 
Research personnel (1) 
Research personnel (2) 
Nurses rated by nurses (1) 
Nurses rated by nurses (2) 
Nurses rated by physicians 
Dentists 


* (1) =sample 1; (2) =sample 2. 


> Mean scores from matched samples did not differ significantly at the .os level or below. 


° In some groups more ratees were evaluated 


on the Personality than on the Work Performance 


criterion, perhaps because raters felt they had greater opportunity to observe personal qualifications 


than job performance. 


ing. The groups of public health and 
research personnel were not broken down 
by profession of ratee for the same rea- 
sons as those specified in the discussion 
of profession of rater. 

Grade of ratee. Grade was controlled 
by the proportionate representation of 
each ratee grade in high, middle, and 
low criterion groups identified within 
each of the separate item-analysis groups. 
The criterion groups were the upper 27 
per cent, the middle 46 per cent, and the 
lower 27 per cent of ratees, determined 
by the average rating they received from 
the appropriate group of raters on the 
separate criteria of Work Performance 
and Personality. 

Within the occupational field of medi- 
cal care, then, four groups were estab- 
lished for purposes of item analysis: (a) 
physicians rated by physicians; (b) nurses 
rated by nurses; (c) nurses rated by 
physicians; and (d) dentists rated by 
dentists and physicians. Two other item- 
analysis groups based on occupational 
fields were also established: (a) public 
health personnel rated by public health 
personnel; and (b) research personnel 


rated by research personnel. 

A criterion score, the average of five 
or more ratings, was computed for each 
ratee on each criterion. A minimum of 
five ratings was required to obtain as 
highly reliable scores as possible without 
excluding from the study a large number 
of the ratees. Where numbers permitted, 
the groups designated for item analysis 
were split into matched samples to pro- 
vide for cross validation. All groups ex- 
cept the dentists and the nurses rated 
by physicians contained enough ratees 
to furnish split samples. 

The number in each item-analysis 
group, as well as the mean and the stand- 
ard deviation of the criterion scores on 
each rating factor, is shown in Table 1. 
From Table 1, it is important to note 
that none of the mean differences in 
matched groups is significant (five per 
cent level or below). 


Item Analysis 


For purposes of item analysis, one 
Experimental Report was selected for 
each ratee. The one Report selected, 
termed the “primary” Report, was gen- 


5 
| 
N 

oO. b No. 

Ratees M = | Ratees*® M SD 
158 12.42 .88 
161 12.54 .60 
9° 13.23 
88 13.60 .25 
66 13.88 -33 
65 13.79 .08 
56 13.28 .04 
55 13.42 .80 
92 12.44 .22 
60 13.29 .10 
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erally completed by a supervisor whose 
administrative levei corresponded to that 
of a branch chief in a division or a clini- 
cal director in a hospital. A “secondary” 
Report was available on most ratees; this 
Report was generally completed by a 
supervisor, such as the officer in charge, 
who was at a higher organizational level 
than the supervisor completing the pri- 
mary Report. Secondary Reports, al- 
though not used in item analysis, were 
scored to provide additional validation 
data. 

Section I—Forced Choice. In a previ- 
ous study, it was found that of four 
methods for item analyzing forced choice 
tetrads, the critical-ratio technique ap- 
peared the most useful in that it was 
relatively easy to apply, gave readily 
interpretable results, and yielded item 
weights which were in close agreement 
with weights derived from the other 
methods studied (g). In the present work, 
critical ratios were used to item analyze 
the Forced Choice section against the 
separate criteria of Work Performance 
and Personality. Within each item-analy- 
sis group, the significance of the differ- 
ence was tested between the percentage of 
the high and the percentage of the low 
criterion groups rated on each alterna- 
tive. In the medical groups, the samples 
were sufficiently large that it was possible 
to use a critical ratio of 1.96 as the 
standard for scoring. In the remaining 
smaller occupational groups, an alterna- 
tive was deemed scorable if the critical 
ratio were 1.50 or greater. Unitary posi- 
tive weights, indicating that a signifi- 
cantly higher percentage of a high cri- 
terion group than of a low criterion 
group had been rated on a given alterna- 
tive, and unitary negative weights, indi- 
cating the reverse, were assigned the 
scorable alternatives in the tetrads. Al- 
ternatives which were nondiscriminating 
received zero weights. 


From the scorable tetrads for each 
item-analysis group, approximately the 
best 20 were selected to constitute scor- 
ing keys. In addition, for each of the 
item-analysis groups which had been 
split, a combined sample scoring key 
was developed. This key was composed of 
the best 20 tetrads selected from those 
in which only alternatives having identi- 
cal scoring weights in the matched sam- 
ples had been retained, For the group 
of nurses rated by nurses, only a com- 
bined sample key was developed since 
the matched samples did not individu- 
ally yield enough scorable items for 
separate keys. 

A total score on the Forced Choice 
section was obtained by summing all 
positively weighted alternatives. A pre- 
vious study indicated that positive 
weights scoring yielded as valid results 
as positive plus negative weights scor- 
ing on three lengths of keys, one of which 
was a twenty-tetrad key. The validity of 
this length of key also compared favora- 
bly with that of the other two key lengths 
studied (5). 

Sections II and III—Job Proficiency 
and Personal Qualifications. Since these 
two sections consisted of ten-point rating 
scales, the same methods of item analysis 
were used for both. First, the discrimina- 
tory capacity of the scales was checked 
by testing the significance of the differ- 
ence in the mean scale values of high and 
of low criterion groups. Of the total 
number of critical ratios computed on 
the Personal Qualifications scales, con- 
sidering all item-analysis groups, 70.6 
per cent (113 out of 160) were significant 
at the .o5 level or below. On the single 
Job Proficiency scale, seven of the item 
analysis groups yielded significant dif- 
ferences (.05 level or below) between 
upper and lower 27 per cent groups on 
one or both criteria. 

The raw scores of all ratees in each 
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item-analysis group were used to develop — 


stanine scores in one-half sigma units, 
with the mean scale value equalling a 
stanine of five. The stanine scoring re- 
sulting for matched samples was so simi- 
lar that the split samples were recom- 
bined to lend greater stability to the 
stanine scales. Cross validation of the Job 
Proficiency and Personal Qualifications 
sections appeared unnecessary in view 
of the consistency in scoring from one 
matched sample to another. 

The totat score on the Personal Quali- 
fications section was the average of the 
stanines from the eight rating scales. 
The score on Job Proficiency was the 
stanine value of the rating given a ratee 
in his primary job function. 

While it would have been desirable 
to treat separately each of the 10 func- 
tions listed in the Job Proficiency sec- 
tion, this was not possible because of 
the small number of ratees performing 
each function. All analyses on this sec- 
tion were made without regard to the 
type of work involved. Specific job func- 
tions were not completely masked, how- 
ever, in view of the method used in 
establishing item-analysis groups. For 
example, of the 179 ratees in the public 
health groups who were rated on the 
Work Performance criterion, 138 were 
given ratings on the Experimental Eff- 
ciency Report in the primary function 
described as “operation in a technical 
or specialized Public Health program.” 

Section IV—Check List. Critical ratios 
were computed to test the significance 
of the difference in the percentages of 
high and of low criterion groups marked 
on each alternative. Items which discrim- 
inated between the two groups at the .o5 
level or below were deemed scorable. 
Considering both criteria and all item- 
analysis groups, 42.3 per cent (186 out 
of 440) of the ratios computed reached 
this level of significance; the scorable 


items, however, were not evenly dis- 
tributed among the various item-analysis 
groups. 

Considering the separate item-analysis 
groups, the number of scorable items 
was such that it was feasible to develop 
scoring keys against the Work Perform- 
ance criterion in only the medical, pub- 
lic health, and research groups, and 
against the Personality criterion in only 
the medical groups. Since these were 
split groups, the decision was made to 
include in the scoring keys only those 
items that reached the required level of 
significance in both of the matched sam- 
ples rather than to develop scoring keys 
for the separate samples. 

The total score on this section was the 
sum of all positively weighted items 
(those characteristic of a high criterion 
group). 

RESULTS AND INTERPRETATION 


Comparison of Forced Choice 
Scoring Keys 

The Forced Choice section of the 
Experimental Reports from each of the 
split sample item-analysis groups was 
scored by three keys: (a) self scoring, de- 
veloped from item analysis of the sample 
being scored; (b) cross scoring, developed 
from item analysis of the matched sam- 
ple and used for cross validation; and 
(c) combined sample scoring, based on 
alternatives that gave the same scoring 
weights in both samples. 

Validity coefficients based on each 
type of Forced Choice key are presented 
in Table 2 for the matched sample 
groups. Data on the three scoring keys 
are given for secondary Reports as well as 
for the primary Reports used in item 
analysis. From the validity coefficients 
in Table 2, it may first be noted that the 
coefficients were highly similar from one 
matched sample to another. In compari- 
sons of matched sample validities, only 
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TABLE 2 
Forcep CHOICE VALIDITY COEFFICIENTS FOR MATCHED SAMPLE GROUPS* 


Work Performance Criterion | Personality Criterion 

Occupational Group No Self Cross Com- N Self Cross Com- 

Scor- Scor- bined Scor- Scor- _ bined 

ing ing Scoring ing ing Scoring 

Primary Reports 

Physicians (1)° 158 .67 .65 .66 159 61 . 69° 
Physicians (2) 161 .58 .56 161 -55 .56 -54 
Public health personnel (1) go -62% .50 -57° 99 ~$3 .56 .58 
Public health personnel (2) 88 61 -59 100 -55 -56 
Research personnel (1) 66 .67 -63 . 69° 69 -55 -62 
Research personnel (2) 65 67° .62 67 .60 .56 -64 
Median r -65 61 .58 .56 .60 

Secondary Reports 

Physicians (1) 120 .67 .66 .67 120 -7° .69 
Physicians (2) 123 -67 .68 123 .63 -66 .66 
Public health personnel (1) 64 -34 71 -49 -52 
Public health personnel (2) 63 -47 -44° -45 68 .42 .50 -47 
Research personnel (1) 51 -40 -37 53 -64 -55 -59 
Research personnel (2) 56 -57 58 -50 .60 .58 
Median r -50 .48 51 .58 -59 


* All rs not marked are significantly different from zero at the .o1 level or below. 
» r is significantly different from zero at the .o5 level. 


© (1)=sample 1; (2) =sample 2. 


4 y on self scoring is significantly higher at the .o1 level than r on cross scoring. 
© r on combined scoring is significantly higher at the .o5 level than r on cross scoring. 
f r on self scoring is significantly higher at the .o5 level than r on combined scoring. 


two differences significant at the .o5 
level or below occurred. These were on 
primary Reports, Personality criterion, 
in the comparison of physicians (1) and 
(2)* on the self (rs = .71 vs. .55) and the 
combined scoring keys (rs = .69 vs. .54).° 

Table 2 also shows that the self and 
the combined keys, both of which rep- 
resent the use of scoring keys with item- 
analysis groups, tended to produce higher 
validity coefficients than did scoring keys 
used with independent samples estab- 
lished for cross validation. This trend is 
apparent both from the median correla- 


* Here, (1) = sample 1; (2) = sample 2. 

‘For purposes of testing the significance of 
the difference in rs, rs were transformed to 2s. 
Tests of differences reported in this paper in- 
volved independent samples and the same sample 
with one array in common (4, p. 124, formulas 
45, 47, and 49). 


tions and from the tests of differences 
in validity from one type of scoring to 
another.® It should be noted that the 
differences in validity by type of scoring 
appeared to decrease when scoring keys 
were applied to Reports (secondary) in- 
dependent of those used in item analysis. 
Only one significant difference in va- 
lidity by type of scoring occurred on the 
secondary Reports. 

Correlations (not shown in the table) 
among the three types of scoring keys 
were high. Considering coefficients based 
on both primary and secondary Reports, 


*Median correlation coefficients have been 
presented in tables merely to aid the reader in 
observing trends in the data, They are not in- 
tended to be precise summary statistics since an 
assumption that the various officer groups were 
samples drawn from a common population is 
not warranted. 
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TABLE 3 
VALIDITY COEFFICIENTS FOR ALL SECTIONS OF THE EXPERIMENTAL EFFICIENCY REPORT*® 


é Work Performance Criterion Personality Criterion 
Occupational 


Group No. 
Ratees FC 


Reports 


Physicians 

Public health 
personnel 

Research personnel 

Nurses rated by 


320 


199 
136 


nurses III 
Nurses rated by 


Median r +54 


Secondary Reports 


Public health per- 

sonnel 49 -50 
Research personnel -49 -58 
Nurses rated by 


Nurses rated by 


physicians -34 -45 
Dentists .48 -64 .50 


243 -68 


Median r -39 +55 


® All rs not marked are significantly different from zero at the .o1 level or below. 
> r is significantly different from zero at the .os level. 


© r does not reach the .os level of significance. 


the median correlation between the self 
and the cross keys was .g2, between the 
self and the combined, .g6, and between 
the cross and the combined, .g6. Both 
the self and the cross scoring keys cor- 
related highly with the combined key 
since the latter was composed of the 
tetrad alternatives scored in both 
samples. 

It was to be expected that the self and 
the combined keys would yield the 
higher validity coefficients since they 
were used to score item-analysis Reports. 
However, when the self keys were ap- 
plied as cross keys to Experimental Re- 
ports independent of those used in item 
analysis, the validities exhibited surpris- 
ingly little decrease in size. Out of 24 
possible comparisons of the self and the 


cross keys, only three (12.5 per cent) 
were significant at the .o5 level or be- 
low. Although cross-validation data were 
not available for the combined keys, it 
seems likely that in successive samples 
they would have greater stability than 
keys derived from item analysis of Re- 
ports completed on a single sample. For 
this reason, subsequent discussion of the 
Forced Choice section of the Experi- 
mental Report will be based only on 
data from the combined sample scoring 
keys. In order to increase the reliability 
of the statistics based on these keys, | 
matched samples have been recombined. 


Validity of the Experimental Report 


Table g presents the validity coeffi- 
cients based on the Forced Choice (FC), 


9 
Primary 
319 .61 -62 -58 -54 -62 -40 -50 .46 
178 -58 -49 -49 -54 | -29 -36 
| 131 .65 -44 -39 -59 -63 
-43 
-40 +33 
-50 -26 -36 -40 
-52 -58 56 
-32 -39 
10° .27 
-35° 49 
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Job Proficiency (JP), Personal Qualifi- 
cations (PQ), and Check List (CL) sec- 
tions of the Experimental Report.’ To 
facilitate interpretation of the correla- 
tions, the factors which may be influenc- 
ing variations in the data will be indi- 
vidually considered. 

Report sections. From the median cor- 
relations in Table 3, it appears that the 
validity coefficients for the Forced Choice 
and Check List sections were higher than 
those obtained for the Job Proficiency 
and the Personal Qualifications sections. 
More specific comparisons of those 
Experimental Report sections which 
showed significant differences in valid- 
ity may be seen in Table 4. 


Out of the 108 possible comparisons of Ex- 
perimental Report sections, 36 (33.3 per cent) 
were significant at the .o5 level or below. In 
27 of the 36 significantly different pairs of co- 
efficients, the Forced Choice section exhibited 
higher validities. In only one instance was the 
validity of the Forced Choice section significantly 
lower than that of another section. The Job 
Proficiency and Personal Qualifications scales 
produced the lowest coefficients; in 15 instances, 
each of these sections yielded a validity co- 
efficient which was significantly lower than that 
of another section. The Job Proficiency scale 
in only three instances and the Personal Quali- 
fications section in only two instances produced 
validities which were significantly higher than 
those of other Report sections. The Check List 
in four comparisons yielded a_ significantly 
higher coefficient than another section, and in 
five comparisons, a significantly lower coefficient. 
In four of the five instances in which the Check 
List produced a lower coefficient, it was com- 
pared with the Forced Choice section of the 
Report. 


In general, the Forced Choice section 
gave the highest validity coefficients. 
Comparisons of validities on the remain- 
ing three sections of the Experimental 
Report yielded relatively few significant 
differences although the Check List, 


‘Reports in the dental group were scored by 
the key developed on physicians. An exploratory 
validation study showed as high validities for 
dentists as for physicians when the medical key 
was used to score reports in both groups. 


where available, tended to produce some- 
what higher validities than did the Job 
Proficiency and Personal Qualifications 
sections. 


TABLE 4 


COMPARISONS OF SECTIONS OF THE EXPERI- 
MENTAL REPORT IN WHICH VALIDITY COEFFI- 
CIENTS DIFFERED SIGNIFICANTLY 


Primary 


Secondary 
Reports 


Reports 


Work Performance 
Criterion 


Report 
Sections 
Compared* 
FC vs. JP» 

FC vs. 
FC vs. CL> 


CL vs. FCe 


Occupational 
Group 


Report 
Sections 
Compared* 


FC vs. CL* 
JP vs. CL* 


Physicians 


Public health per- 
sonnel 


Research personnel | FC vs. JP 
FC vs. PQ> 
CL vs. 


CL vs. 


Nurses rated by 
physicians 


FC vs. 
Dentists | FC vs. PQs 


| Fe ve p> 


CL vs. PQ* 


Personality Criterion 


FC vs. 
FC vs. 
FC vs. 
PQ vs. 


FC vs. 
FC vs. 


FC vs. JP® 
FC vs. PQ» 
PQ vs. 


FC vs. 


Physicians 


Public health per- 
sonnel 


Research personnel FC vs. JP 


FC vs. 


Nurses rated by 
nurses 


FC vs. JP» 
FC vs. PQ» 


FC vs. JP* 


Nurses rated by 
physicians 


Dentists 


| JP vs. 


® The rating section on which the higher valid- 
ity was obtained is listed first. 

> Validity coefficients for the sections com- 
pare differ significantly at the .o1 level or be- 
ow. 


© Validity coefficients for the sections compared 
differ significantly at the .o5 level. 


| 
| 
PQ? | FC vs. PQ> 
CL» | FC vs. CL» 
JP» | JP vs. PQ 
JP® | FC vs. JP 
| 


METHODS FOR EVALUATING PROFESSIONAL HEALTH PERSONNEL 


TABLE 5 


COMPARISONS OF OCCUPATIONAL GROUPS IN WHICH VALIDITY COEFFICIENTS 
DIFFERED SIGNIFICANTLY 


Primary Reports 


Work Performance Criterion 


Personality Criterion 


Groups Compared* 


Groups Compared* 


Physicians vs. Nurses rated by nurses 
Nurses rated by phys. 
P. h. personnel 
Res. personnel 
Nurses rated by nurses 
Nurses rated by phys. 
Res. sonnel 


. per! 

Nurses rated by nurses 
Nurses rated by nurses 
Nurses rated by phys. 


Physicians vs. Nurses rated by nurses 

Nurses rated by phys. 
ntists 

Res. personnel 

Res. personnel 


. Nurses rated by phys. 
Dentists 
. Nurses rated by nurses 


Nurses rated by phys. 
Dentists 


P. h. personnel 
Res. personnel 
Nurses rated by nurses 
P. h. personnel 


rated by nurses 

rated by phys. 
P. h. personnel 

Res. personnel 

Nurses rated by nurses 

Nurses rated by phys. 

P. h. personnel 

Nurses rated by phys. jr 


Dentists 


Roh. sonnel 
Nurses rated ag phys. 
P. h. personnel 
Res. personnel 
Nurses rated by nurses 
Nurses rated by phys. 
. h. personnel 
Res. personnel 
Nurses rated by nurses 
Nurses rated by phys. 
. Nurses rated by nurses 
Nurses rated by phys. 


Dentists 


® The group in which the higher validity was obtained is listed first 


» Validity coefficients for the groups compared 
© Validity coefficients for the groups compared d. 


Occupational group. From Table 3, 
the significance of the difference was 
also tested between the validity coeffi- 
cients obtained from one occupational 
group to another. The comparisons 
yielding significant differences are sum- 
marized in Table 5. It should be men- 
tioned that tests of differences were only 
made between independent occupa- 
tional groups. The two nursing groups, 
which overlapped in membership, were 
not compared since the amount of com- 
putational work involved did not seem 
warranted by the relatively small differ- 
ences in validity that occurred in most 
instances between the two groups. 

Out of the 182 group comparisons made, 
Table 5 shows that 44 (24.2 per cent) yielded 
differences significant at the .o5 level or below. 
It is rather striking that in 3g of the significant 
comparisons, the medical group produced the 
higher validity coefficient, while in no instance 
did it produce a significantly lower one. In 25 


= significantly at the .or level or below. 
er significan 


tly at the .os level. 


of the significant comparisons, the lower coef- 
ficient occurred in one of the two nursing 
groups. Coefficients in the public health, re- 
search, and dental groups tended to be of about 
the same magnitude, differing in some instances 
from the two extreme groups of physicians and 
nurses. No significant differences were found 
between the public health and the research 
groups; only one significant difference occurred 
in the comparisons of dentists and research per- 
sonnel, and two in the comparisons of dentists 
and public health personnel. 


The higher validities in the physicians 
group are perhaps due to factors intrinsic 
in the work situation, Such factors may 
be supervisors’ relatively greater oppor- 
tunity to observe carefully the work of 
medical personnel, particularly interns 
or lower grade officers under close super- 
vision, and to develop evaluation stand- 
ards and rating experience since physi- 
cians constitute the largest professional 
group in the Public Health Service. 

Another possible explanation of the 


11 

Po 
Res. personnel vs. Res. personnel vs 
FC 
Secondary Reports 
Physicians vs. FC? Physicians vs 
FC> 
p> 
pe 
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validity coefficients obtained for the 
physicians group is that, in item analysis, 
a higher critical ratio was used as the 
standard for scoring in this group than 
in the other occupational groups. How- 
ever, a recent study would seem to indi- 
cate that a stringent requirement for the 
level of discrimination of individual 
items does not necessarily increase total 
validity (2). 

Criteria. Inspection of the validity 
coefficients (Table 3) from one criterion 
to another within the same level of 
supervisor shows that on all but the 
Forced Choice section higher validities 
occurred on Work Performance than on 
Personality. On the Forced Choice sec- 
tion, as high or higher validities were 
obtained on the Personality as on the 
Work Performance criterion for both 
levels of reporting supervisor, and for 
all occupational groups except the den- 
tists, and the nurses rated by physicians. 

A possible explanation of the observed 
differences in validity by criteria may be 
that a supervisor, if able to ascertain 
what constitutes a “good” and a‘poor” 
rating (as is the case for rating scales 
and check lists), can more objectively 
evaluate the ratee on observable work- 
performance characteristics than he can 
on personal characteristics. When good 
and poor ratings are not so readily dis- 
cernible, as presumably is the case with 
the forced choice type of evaluation, the 
objectivity of evaluations is perhaps in- 
creased so that they are as valid measures 
of the factors involved in a Personality 
as in a Work Performance criterion. 

Level of supervisor. The validity coeffi- 
cients in Table 3 may also be compared 
by level of supervisor. Since the primary 
Reports were used in item analysis, it 
is to be expected that they would yield 
higher validities than the secondary Re- 
ports. The median correlations in the 


table, however, indicate that validity 
held up surprisingly well on secondary 
Reports; the median correlations on the 
Personality criterion were even some- 
what higher on the secondary than on 
the primary Reports. Considering the 
42 pairs of coefficients which can be com- 
pared from one level of supervisor to 
another, primary Reports produced the 
higher validity in 21 comparisons, and 
secondary Reports produced the higher 
validity in the same number of com- 
parisons. The median difference in co- 
efficients was .o7 for those comparisons 
in which primary Reports produced 
higher correlations, and .o8 for those 
comparisons in which secondary Re- 
ports gave higher validities. 
Differences in validity by level of 
supervisor, however, were observable 
within specific occupational groups. In 
75 per cent or more of the comparisons 
of coefficients within the public health 
and the nurses-rated-by-nurses groups, 
the higher validity occurred on primary 
Reports. Since the primary Reports were 
used in item analysis, this finding is in 
the expected direction but in addition 
possibly reflects the fact that the pri- 
mary, more immediate, supervisors of 
these groups are more likely to be in 
the ratees’ profession than are the sec- 
ondary supervisors. In 75 per cent or 
more of the comparisons within the 
medical and dental groups, the higher 
coefficients occurred on secondary Re- 
ports. It should be mentioned that hos- 
pitals and outpatient clinics are admin- 
istered by a medical officer, and dental 
services are headed by a dental officer. 
For this reason, both the secondary and 
the primary supervisors of physicians 
and dentists are likely to be in either 
the same profession as the ratees or the 
same as the raters who performed cri- 
terion ratings. Further, secondary super- 
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visors of physicians and dentists are 
likely to be officers who routinely review 
efficiency reports and, therefore, have 
more information available as a_ basis 
for evaluating a ratee than do the pri- 
mary supervisors. 

In general, then, Reports completed 
independently by a second group of 
supervisors produced validities that com- 
pared favorably with those based on the 
Reports used in item analysis. It is likely 
that, had the secondary supervisors been 
at the same administrative level as the 
primary, differences in validity by level 
of supervisor which were apparent in 
specific occupational groups would have 
tended to occur less often; that is, the 
validities would be more nearly the same 
in all groups than occurred in the pres- 
ent data. 

Ratee grade. A control on ratee grade 
was used both in the administration of 
criterion rating forms and in the estab- 
lishment of occupational groups for pur- 
poses of item analysis. Nonetheless, since 
it was not feasible to control grade more 
precisely, this factor may be operating 
to increase spuriously the correlations 
shown in Table g. In order to check this 
possibility, validity coefficients by grade 
were computed; small numbers of ratees 
made it necessary in some instances, 
however, to combine adjacent grades 
such as the senior and the director. Cor- 
relations by grade are presented in Table 
6. 

If grade were a systematic factor affect- 
ing both the criterion and Experimental 
Report variables, it would be expected 
that the validities based on all grades 
would be higher than the individual 
grade validities. That this is not the case 
may be seen from Table 6. There does 
not appear to be any consistent trend 
in the correlations as a function of grade 
level. The effect of combining grades was 


the masking of the higher validity ob- 
tained for some specific grades. In only 
one instance, in the nurses-rated-by- 
physicians group, was the correlation for 
all grades higher than any of the validity 
coefficients for the individual grades. 


The significance of the difference in validity 
coefficients from one grade to another was tested. 
The relatively few comparisons, 16 out of a 
possible 122 (13.1 per cent), which yielded dif- 
ferences significant at the .o5 level or below are 
shown in Table 6. No differences occurred in 
the public health group, and only two were 
found in the group of nurses rated by physicians. 

In the medical group, the significant differ- 
ences tended to involve higher validities in the 
Assistant grade as compared with other grades. 
Out of the 14 significant differences found in 
this professional group, 10 occurred in com- 
parisons in which the higher coefficient was in 
the Assistant grade. This finding may be due 
to the fact that the majority of Assistant grade 
physicians are interns under close supervision; 
the supervisors of interns are experienced raters 
who have ample opportunity to observe the 
interns’ performance. 

The remaining four significant differences in 
the medical group occurred on the Reports 
completed by the secondary supervisors, Work 
Performance criterion, in the comparison of the 
combined Senior and Director grade with the 
Senior Assistant grade. The unusually high va- 
lidities in the combined Senior and Director 
grade may have been due to the small number 
of ratees (about half the number available on 
primary Reports) and perhaps to a selective 
factor that resulted in the designation of highly 
experienced raters as the secondary supervisors 
for this grade. 


Although differences in the validity 
coefficients found for the various ratee 
grades did occur, they were presumably 
the result of certain identifiable influ- 
ences. The grade factor, as such, does 
not appear to have been operating in 
any systematic manner which spuriously 
increased the validities based on all 
grades. 


Reliability of the Experimental Report 


Rater agreement. Correlations be- 
tween scores from primary and secondary 
Reports, shown in Table 7, provide one 


TABLE 6 


BASED ON SEPARATE GRADES 


VALIDITY COEFFICIENTS FOR THE EXPERIMENTAL EFFICIENCY REPORT 


Occupational 
Group 


Primary Reports 


Work Performance Criterion 


Personality Criterion 


PQ 


CL 


JP PQ 


Nurses rated by 
physicians 


Physicians 


Public health per- 
sonnel 


Nurses rated by 
physicians 


Physicians 


Assistant 
Senior assistant 
All grades* 


Assistant 

Senior assistant 
Full 

Senior & Directcr 


All grades 


Full 
Senior 
All grades 


.20 -32 


+20 -52 


Assistant 
Senior assistant 


All grades 


Assistant 


Senior assistant 
Full 
Senior & Director 


All grades 


Full 
Senior 
All grades 


Public health per- 


+53 
sonnel 


.65 
-50 


42 
127 


+34 


+44 
+37 


47 
139 


® The N for All Grades is based on all available cases including those in grades too small for the separate grade analysis. 


» + is significantly higher at the .o1 level than the underlined r in same column within the same report within the same 
officer group. 


© r is significantly higher at the .os level than the underlined r in same column within the same report within the same 
officer group. 


TABLE 7 
CORRELATIONS BETWEEN SCORES FROM PRIMARY AND SECONDARY REPORTS* 


Work Performance Criterion | Personality Criterion 


Occupational 
Group 


Ratees FC 


No. 
Ratees FC 


JP CL 


.61 


PQ 
.58 


Physicians 

Public health per- 
sonnel 

Research personnel 

Nurses rated by 


243 -57 243 -60 «51 


-57 42 47 -46 


54 


-63 
.64 


139 
11! 
+59 gt -57 
.62 


nurses -39 
Nurses rated by 
physicians 


.50 
Dentists 


72 
-65 


43 


Median r .58 -45 -54 56 


-63 -49 


* All rs not marked are significantly different from zero at the .o1 level or t elow. 
» r is significantly different from zero at the .o5 level. 


Grade 
N FC || N FC CL 
28 +23 -02 28 -39 
92 .28 -43 or -37 .36 
108 -54 +53 -50 45 108 -65 +35 
| 58 +47 45 58 -65 +34 +54 +45 
7° .65 «$7 -48 71 24 +20 
319 -62 -58 -54 320 -62 -46 
ss .68 -63 66 59 +59 -39 +30 
63 -43 -45 «$2 73 +35 
178 -58 -49 +54 199 +57 +29 +36 
Secondary Reports 
35 -63 +39 35 —.20 .18 
24 -49 .22 24 50 
73 -52 +22 +34 72 +10 +27 
62 .69 -70° 62 -66 69° -68 
100 “44 42 44 100 -67 “44 
50 +73 -52 -62 .58 50 +72 +43 -49 
3r -80° 73° 73° 31 .68 -62 -74 
243 -67 .58 -60 .60 243 .56 
.48 -56 
+24 +39 
| -37 -49 | +30 -33 
| -58 -59 
127 -41 
| 107 -54 -61 
+55 
.48 
-52 -61 
les 
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TABLE 8 


RELIABILITY COEFFICIENTS FOR THREE SECTIONS OF THE EXPERIMENTAL 
EFFICIENCY REPORT 


Forced Choice 


Personal 
Qualifications* 


No No. 
0. Scored 
Scored r 
Ratees Tetrad yuu 


Res. personnel 

Nurses rated by nurses 
Nurses rated by phys. 
Dentists 


Median Ti 


The number of 


® The number of rating scales was eight, with the possible total stanine score ranging from 8 to 72. 
b scored al! than 


ternatives rather 


kind of measure of the reliability of the 
Experimental Report. 

Correlations between Reports com- 
pleted by the two groups of supervisors 
ranged from .27 to .65; over half were 
.55 or higher. Only the lowest correla- 
tion, .27, failed to reach the .o1 level of 
significance; it was significant at the .o5 
level. It should be noted that the Job 
Proficiency (JP) section which consisted 
of a single rating scale tended to produce 
the lowest correlations. As the median rs 
indicate, the Forced Choice (FC) section 
tended to yield the highest correlations, 
although for the Work Performance cri- 
terion these did not differ markedly from 
those produced by the Personal Qualifi- 
cations (PQ) and Check List (CL) sec- 
tions. For the Personality criterion, the 
Forced Choice section gave the highest 
correlations in all instances, although the 
two correlations available on the Check 
List were of comparable size. 

Since the primary and the secondary 
supervisors were at different administra- 
tive levels, the correlation coefficients are 
lower than might be obtained between 
Reports completed by different super- 
visors at the same level or between Re- 
ports completed by the same supervisor 
on two different occasions. Considering 
the factors operating to lower the coeffi- 


was used as the basis for computing reliability ceciiiniaans, 


cients, the correlations between scores 
on primary and secondary Reports, as 
measures of rater agreement, are fairly 
high. 

Spearman-Brown estimates. From pri- 
mary Reports, scored by the key de- 
veloped against the Work Performance 
criterion, correlations between the odd 
and the even alternatives in each of three 
rating sections were corrected for length 
by the Spearman-Brown formula. A pre- 
vious paper has reported that for Forced 
Choice tetrads, Spearman-Brown esti- 
mates of reliability were fairly close ap- 
proximations of empirical reliabilities 
(5). Spearman-Brown estimates based on 
the present data are shown in Table 8. 
Since the Job Proficiency section involved 
only a single rating scale, it was not pos- 
sible to compute a split-half coefficient on 
this part of the Report. 

Considering the length of the various 
sections of the Report, the reliability co- 
efficients are in the high range. As can be 
seen from Table 8, the number of scored 
alternatives on the Forced Choice and 
Check List sections varied somewhat for 
the different occupational groups. The 
number of rating scales in the Personal 
Qualifications section was the same 
(eight) for all groups. 

The highest Spearman-Brown  esti- 


15 
Check List 
Occupational Grou; No. 
ne Scored 
u Alterna. "$0 
tives 
Physicians 31 28 .83 .89 12 -73 .85 
P. h. personnel 18 30 .83 7 .67 
131 18 22 .88 -95 9 
21 29 .78 .88 -92 .96 
92 19 a3 64 .78 
60 20 2 .83 -90 -95 12 . 86 -92 
-9s 
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TABLE 9 


COMPARISON OF EMPIRICAL VALIDITY COEFFICIENTS WITH VALIDITIES PREDICTED ON THE BASIS OF 
AN INCREASE IN LENGTH OF SCORING KEY 


No. 
: 5 Scored Empirical Estimated Limit of 
Occupational Group | Section rn Validity  Validitye Validity? 
tives 
Physicians FC 28 QI 61> 
PQ 8 -94 +59 -60 
CL, 12 +54 -56 -59 
Public health personnel FC 30 .58 .61 
7 .80 -54 .59 .60 
Research personnel FC 22 .88 .65* -69 
PQ +39 +40 -40 
Ch, -63 -66 
Nurses rated by nurses FC 29 .88 -42 +45 
PQ 8 .96 -39 -40 .40 
Nurses rated by physicians FC 23 .78 -44 -50 
PQ 8 97 -43 43 44 
Dentists FC 28 .56 
PQ 8 -95 “55 -56 .56 
CL 12 -92 .50 -51 


® r is significantly higher at the .o1 level than the underlined r in the same column within the same 


officer group. 


> r is significantly higher at the .os level than the underlined r in the same column within the same 


officer group. 


¢ Estimated validity based on the same number of scored alternatives as the Forced Choice section. 
4 Limit of validity if report section were made infinitely long. 


mates of reliability occurred on the Per- 
sonal Qualifications (PQ) scales; all r,,;’s 
were .g1 or higher. Coefficients on the 
Check List (CL) ranged from .80 to .g2, 
with a median of .83. The Forced Choice 
(FC) section yielded satisfactory reliabil- 
ities in all officer groups except the 
nurses-rated-by-physicians (r,; = .78); all 
other coefficients on this section were .88 
or .g1, with a median of .go. 


Validity as Related to Length of Scoring 
Key 
Since the sections of the Experimental 


Report were not equated in length, the 
fact that the Forced Choice was the long- 
est section may account for its higher 
validity. The effect of length of scoring 
key on validity was tested on the Experi- 
mental Report sections for which both 
validity and reliability data were avail- 
able from the primary Reports scored by 
keys developed on the basis of the Work 
Performance criterion. The results of the 
tests are shown in Table g which pre- 
sents: (a) empirical validities and reli- 
abilities and the number of scored al- 
tefnatives in each Report section, re- 


“2 
. 
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peated from previous tables for ease of 
comparison; (b) estimated validities for 
the Personal Qualifications (PQ) and the 
Check List (CL) sections, based on an 
increase in length of these sections to 
that of the Forced Choice (FC); and (c) 
the maximum validity that theoretically 
could be obtained on each section if it 
were made infinitely long (1, p. 166). 

From Table 9g, it may be seen that the 
Forced Choice section produced the 
highest empirical validity in all but one 
occupational group, the dental. Theo- 
retically increasing the length of the 
other Report sections to that of the 
Forced Choice resulted in one additional 
group, public health personnel, in which 
the validity (estimated) of a section other 
than the Forced Choice was the highest. 
It is likely, however, that had the Re- 
port sections been equated for length, 
the three significant differences in val- 
idity (Table g, footnotes a and b) that 
were obtained from one Report section 
to another would still have occurred. 

If each of the Report sections were made in- 
finitely long, it is apparent from a comparison 
of obtained validities with the theoretical limits 
of validity in Table g that the Personal Quali- 
fications (PQ) section could be expected to show 
the smallest increase in validity (no more than 
.02). On the Forced Choice (FC) section, how- 
ever, the increase to be expected ranges from 
.03 to .06, and on the Check List (CL), from 
02 to .07. 

While length of the various Report 
sections appears to have affected the mag- 
nitude of the validity coefficients, the 
greater number of scored alternatives in 
the Forced Choice (FC) section does not 
appear primarily responsible for the gen- 
erally higher validity of this section. Al- 
though the Check List (CL) was the sec- 
tion most affected by the small number 
of scored alternatives, the evidence on 
limit of validity seems to indicate the 


relatively higher validity of the Forced 
Choice section. 


Multiple Correlations 


In order to determine the combination 
of Experimental Efficiency Report sec- 
tions which would, for each occupational 
group, maximally predict the separate 
criteria, multiple correlation coefficients 
based on primary Reports were com- 
puted by the Wherry-Doolittle method 
of test selection (4, Chap. XIV). The in- 
tercorrelations on which the multiple 
correlation work was based are shown in 
Table 10. As can be seen from Table 10, 
the intercorrelations among Report sec- 
tions were quite high, particularly in the 
medical and dental groups. In all occupa- 
tional groups, the highest intercorrela- 
tions tended to occur between the Job 
Proficiency (JP) and Personal Qualifica- 
tions (PQ) sections, with correlations 
ranging from .66 to .85. The Forced 
Choice (FC) and the Check List (CL) 
sections were also highly related, with a 
range in correlations from .68 to .82. 

The multiple correlational data are 
presented in Table 11, which shows the 
Rs based on selected predictors, the Rs 
obtained by application of the Wherry 
shrinkage formula, the validity coefficient 
of the best predictor in each occupa- 
tional group, and the beta weights of 
predictors in the order in which the pre- 
dictors were selected. 

As Table 11 indicates, all Rs were sig- 
nificantly different from zero at the .o1 
level or below. Within each occupational 
group, the increase in validity as a result 
of using a team of Report sections rather 
than a single predictor may be seen by 
comparing the R with the r of the first 
selected predictor. However, since the 
Wherry-Doolittle method of test selection 
does not guarantee that the increment 
due to the selection of successive varia- 
bles significantly increases validity, the 
null hypothesis was tested by use of the F 
ratio (3, p. 55). Tests for the significance 
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TABLE 10 
INTERCORRELATIONS AMONG SECTIONS OF THE PRIMARY REPoRTS* 


Personality Criterion 
FC-JP FC-PQ FC-CL JP-PQ JP-CL PQ-CL 
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+73 


Work Performance Criterion 
N  FC-JP FC-PQ FC-CL JP-PQ JP-CL PQ-CL 


MAW 
~ 
$23 

“Tt ~ 
ND O 


Occupat onal Group 


P. h. personnel 
personnel 


Res. 


Physicians 
Nurses rated by nurses 


Nurses rated by phys. 


Dentists 


® All rs are significantly different from zero at the .o1 level or below. 


of the difference in the R? (or r*) based 
on the first selected predictor and the 
R? based on all selected predictors 
showed that, on the Work Performance 
criterion, validity was significantly in- 
creased at the .o5 level or below in all 
groups except the dental. On the Per- 
sonality criterion, a significant increase 
at the .o5 level or below occurred in only 
the largest group, the medical. 

As was previously noted in the discus- 
sion of the validity coefficients in Table 
3, the data in Table 11 indicate that the 
validities obtained against the Work 
Performance criterion were higher than 
those on Personality, and that the physi- 
cians group was the one in which predic- 
tion was best. 

The relative effectiveness of the four 
Report sections as measures of per- 
formance within each occupational 
group is evident from the results of the 
test selection. The predictors in Table 
11 are those which, as determined by 
the Wherry-Doolittle method, combine 
to produce maximum multiple correla- 
tions. Inspection of the order in which 
predictors were selected and of the num- 
ber of occupational groups in which they 
were selected shows that, considering 
both criteria, the Forced Choice (FC) 
section was a selected predictor in eleven 
of the twelve groups. Further, the Forced 
Choice section was the first selected pre- 
dictor in nine of the groups. The Per- 
sonal Qualifications (PQ) section was the 
next most frequently selected Report 
section occurring in eight of the officer 
groups, while the Job Proficiency (JP) 
section and the Check List (CL) each 
occurred in three groups. In general, the 
Forced Choice section in combination 
with one of the rating scale sections, 
usually Personal Qualifications, tended 
to produce the maximum correlations 
with the criteria. 
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Since the multiple correlational work 
was based on validity coefficients ob- 
tained for item-analysis samples, the 
findings concerning the relative effec- 
tiveness of the various Report sections 
and the sizes of the multiple correlation 
coefficients require verification on inde- 
pendent samples. Evidence from the 
“cross” scoring keys, however, indicated 
that little decrease is to be expected in 
cross validation of the Forced Choice 
section of the Report. Further, cross 
validation of the Personal Qualifications 
and Job Proficiency scales was not 
deemed necessary since matched samples 
had produced highly similar scoring 
keys. In view of these considerations, it 
is likely that the validity coefficients will 
not show a marked decrease in subse- 
quent samples. 


Validity of the Officer’s Progress Report 


Validity coefficients based on the Rat- 
ing Scales (RS) and the Narrative Com- 
ments (NC) sections of the Officer's 
Progress Report and on sections of the 
Experimental Report completed by the 
primary supervisors are shown in Table 
12. Considerable attrition in the number 
of Experimental Reports occurred as a 
result of using only those ratees on whom 
both Reports were available. The 
median correlations for the Experi- 
mental Report shown in Table 12, how- 
ever, are about the same size as the corre- 
sponding median validities in Table 3. 

Tests of the significance of the difference in 
the validity coefficients in Table 12 from one 
Report section to another revealed that coeffi- 
cients in 20 per cent (go out of a possible 150) of 
the comparisons differed significantly at the .o5 
level or below. The specific comparisons which 
produced significant differences are shown in 
Table 13. The percentage of significant compari- 
sons was less than occurred in the tests of differ- 
ences on the Experimental Report (see Table 4); 
this was probably due to the smaller number of 
cases available on the combined Reports. 

As was previously found, the Forced Choice 


TABLE 12 
b 
VALIpIty COEFFICIENTS FOR THE EXPERIMENTAL REPORT AND THE OFFICER'S PROGRESS REPORT 


Work Performance Criterion 


Personality Criterion 


Experimental Report 
(Primary Reports) 


) 


tal Report 
(Primary Reports 


Experimen 


Occupational 
Group 


Nurses rated by nurses 
Nurses rated by phys. 


Res. personnel 
Dentists 


Physicians 
P. h personne 


Median r 


tal and Progress Reports were available. 


he .o1 level or below. 
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rent from zero at t 
he .o5 level. 
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© r is significantly different from zero at t 


* Based on the number of officers 
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TABLE 13 


COMPARISONS OF SECTIONS OF THE EXPERI- 
MENTAL REPORT AND THE OFFICER’S PROGRESS 
REPORT IN WHICH VALIDiTY COEFFICIENTS 
DIFFERED SIGNIFICANTLY 


Personality 
Criterion Criterion 
Occupational R 
rt Re 
Group Sections. | Sections 
Compared* | Compared* 
Physicians FC vs. CL* | FC vs. CL» 
P vs. PQ® | PQ vs. CL* 
P vs. CL» 
P vs. RS» 
JP vs. NC* 
Public health per- | FC vs. JP* | FC vs. JP» 
sonne FC vs. FC vs. PQ» 
FC vs. RS> 
FC vs. NC» 
Research personnel | FC vs. JP* | FC vs. JP® 
FC vs. PQ» | FC vs. PQ» 
FC vs. RS* | FC vs. RS® 
NC vs. PQ* | FC vs. NC» 
PQ vs. JP 
Nurses rated by FC vs. NC*| FC vs. NC* 
nurses PO vs. 
PQ vs. NC> 
Nurses rated by FC vs. NC* 
physicians RS vs. JP* 
vs. NC* 
RS vs. NC» 


*The Report section on which the higher 
validity occurred is listed first. 

> Validity coefficients for the sections compared 
differ significantly at the .o1 level or below. 

° Validity coefficients for the sections compared 
differ significantly at the .os level. 


(FC) section produced more (18 out of 30) of the 
significantly higher validity coefficients than any 
other Report section. In no instance was a va- 
lidity coefficient on the Forced Choice section 
significantly lower than that of another section. 
The number of comparisons in which each of 
the remaining five Report sections produced a 
validity significantly higher than another section 
ranged from none on the Check List to five on 
the Personal Qualifications section. 


From the results of the significance of 
difference tests and from the median 
validity coefficients, it is interesting to 
note that the two sections of the Progress 
Report, Rating Scales (RS) and Narra- 


tive Comments (NC), produced validities 
that compare favorably with all sections 
of the Experimental Report except the 
Forced Choice. 

The data on the Officer’s Progress Re- 
port again suggest the relative superi- 
ority of the forced choice type of evalu- 
ation as compared with more conven- 
tional rating methods. However, since 
the Progress Report was completed 
under operational rather than experi- 
mental conditions, no attempt will be 
made to compare the two Reports by use 
of multiple correlational techniques. It 
is anticipated that in a later study, it 
will be possible to collect data on the 
Progress Report along with cross-valida- 
tional data for the Experimental Report 
so that a more intensive comparison of 
the two Reports can be made. 


SUMMARY 

This study has compared the relative 
efficacy of the forced choice technique 
with other more conventional evalua- 
tion methods as measures of the per- 
formance of professional health person- 
nel working as commissioned officers in 
the United States Public Health Service. 

Four sections of an Experimental 
Efficiency Report were studied: (a) 50 
Forced Choice tetrads adapted from 
items developed by the Department of 
the Army; (b) a ten-point scale for rat- 
ing a ratee’s Job Proficiency in his pri- 
mary job function; (c) eight ten-point 
scales for the evaluation of Personal 
Qualifications; and (d) a twenty-two-item 
Check List developed from comments 
appearing in the Officer’s Progress Re- 
port, the efficiency report in operational 
use in the Service. In addition, two sec- 
tions from the Officer’s Progress Report 
were available for comparison with those 
in the Experimental Report: (a) eleven 
five-point Rating Scales for evaluating 
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various aspects of performance in the 
Public Health Service; and (b) Narra- 
tive Comments coded and scored by a 
method previously developed. 

The criteria of Service performance 
were twenty-point graphic rating scales 
for the evaluation of Work Performance 
and Personality. A ratee’s criterion score 
was the average of the ratings given him 
by his work associates on each criterion. 
The results of the study have shown that: 

1. The Forced Choice section of the 
Experimental Report was highly effec- 
tive for evaluating the performance of 
professional personnel commissioned in 
the Public Health Service. 


Of 24 validity coefficients based on scoring 
keys developed by selecting the best tetrads from 
those which had the same empirically deter- 
mined scoring weights in independent matched 
samples, 41.7 per cent were .62 or higher, All 
except one of the coefficients were significant at 
the .o1 level or below; this one was significant 
at the .o5 level (Table 2, “combined scoring”’). 

Only 12.5 per cent of 24 validity coefficients 
based on item-analysis samples showed a sig- 
nificant decrease at the .o5 level or below in 
cross validation (Table 2, comparison of “self” 
and “cross” scoring). 


2. The validity of the Forced Choice 
section was generally higher than that of 
the other Report sections studied. 


Out of 36 significant differences (.05 level or 
below) obtained in comparisons of the validity 
of the Experimental Report sections, 27 (75 per 
cent) involved higher validities on the Forced 
Choice tetrads, while only one involved a lower 
coefficient on this section (Table 4). 

The Forced Choice section contained a greater 
number of scored alternatives than the other 
sections of the Experimental Report; estimates 
of validity based on theoretically making each 
section infinitely long, however, seemed to indi- 
cate that the length of the Forced Choice sec- 
tion was not primarily responsible for its gen- 
erally higher validity (Table 9). 

Out of 12 multiple correlation coefficients 
computed on the Experimental Report by the 
Wherry-Doolittle method of test selection, 11 
involved the Forced Choice section as a selected 
predictor; in nine of the 11, this section was the 
first selected predictor (Table 11). 
Comparisons of the validities of sections of 


both the Experimental Report and the Officer's 
Progress Report revealed go significant differ- 
ences; 18 (60 per cent) involved higher validi- 
ties on the Forced Choice tetrads while none 


involved a 
(Table 193). 


lower coefficient on this section 


g. Of six occupational groups for 
which separate scoring keys were de- 
veloped for the Experimental Report, 
the largest group, that of hospital physi- 
cians, was the one in which the highest 
validity coefficients generally occurred. 
The occupational groups, other than 
physicians, which were involved in the 
study were dentists, public health per- 
sonnel, research personnel, and nurses 
rated by two different criterion rater 
groups, physicians and nurses. 


Of 44 significant differences (.05 level or be- 
low) obtained in comparisons of validity coeffi- 
cients from one occupational group to another 
on sections of the Experimental Report, 33 
(75 per cent) involved higher coefficients in the 
medical group (Table 5). 


Multiple correlations (R) for the Experimental 
Report computed by the Wherry-Doolittle 
method were, for the medical group, .68 and .63 
against the Work Performance and the Per- 
sonality criteria, respectively. Both coefficients 
were significant at the .o1 level or below, and 
both represented a significant increase (.05 level 
or below) in validity over that obtained on the 
second-best single Report section. Multiple cor- 
relations in the public health and the research 
groups were also relatively high, ranging from 
57 to .67 on the two criteria (Table 11). 


4. Validity coefficients were generally 
higher when Work Performance rather 
than Personality was used as the cri- 
terion. 


On all sections of the Experimental Report 
except the Forced Choice, higher validities were 
obtained with the Work Performance criterion 
than with the Personality criterion. Forced 
Choice validities were not consistently higher 
for either criterion (Table 3). 

Within each officer group, a higher multiple 
correlation coefficient was obtained for the Ex- 
perimental Report when Work Performance 


was used as the criterion than when Personality 
was used (Table 11). 

Considering sections from both the Experi- 
the 


mental and Progress Reports, higher 
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validities occurred in all but one instance when 
Work Performance rather than Personality was 
used as the criterion (Table 12). 


5. Experimental Reports completed 
by a group of supervisors independent 
of and at a higher administrative level 
than those completing the Reports used 
in item analysis produced validities that 
compared favorably with the validities of 
the item-analysis Reports. 


Of 42 comparisons of validity from one level 
of supervisor to another, 21 involved higher 
validity coefficients on item-analysis Reports, 
and 21 involved higher validities on Reports 
completed by an independent group of super- 
visors. The median difference in validity coef- 
ficients in those comparisons in which item- 
analysis Reports yielded the higher validities 
was .07, and in those in which the independent 
Reports gave higher coefficients, o8 (Table 3). 


6. Validity coefficients based on sec- 
tions of the Experimental Report did not 
show a consistent trend as a function 
of grade level. 


Of 122 possible comparisons of validity coef- 
ficients from one grade to another, only 16 (13.1 
per cent) yielded differences significant at the 
.05 level or below. Validity coefficients for 
separate grades were also compared with those 
based on all grades. The effect of combining 
grades appeared to be the masking of the higher 
validity obtained in certain specific grades; in 
only one instance was a combined grade validity 
higher than any of the coefficients for the 
separate grades (Table 6). 


7. The sections of the Experimental 
Efficiency Report exhibited satisfactory 
reliabilities. 


Spearman-Brown estimates of reliability for 
three of the Report sections ranged from .78 to 
.97 ‘Median reliabilities (rx) were .95, .go, and 
.83, respectively, for the Personal Qualifications, 
the Forced Choice, and the Check List sections. 
It was not possible to compute a split-half 
coefficient for the Job Proficiency section since 
it consisted of a single rating scale (Table 8). 

As a measure of rater agreement, scores on 
Reports completed by two groups of supervisors 
at different administrative levels were correlated. 
Over half of the correlations between the two 
sets of Reports were .55 or higher (Table 7). 


8. The Rating Scales and Narrative 


Comments sections of the Officer’s 
Progress Report appeared to be about as 
adequate measures of performance as 
sections of the Experimental Report 
other than the Forced Choice. 


Median validity coefficients for the Progress 
Report compared favorably with those for sec- 
tions of the Experimental Report other than 
the Forced Choice. Data on the two Reports, 
however, were collected under different condi- 
tions so that comparative results are viewed as 
tentative (Table 12). 

The significance of the difference was tested 
in the validity coefficients obtained for the vari- 
ous sections of both Reports. Significantly higher 
(.05 level or below) validities occurred on each 
section of the Progress Report about as fre- 
quently as on the Experimental Report sections 
other than the Forced Choice (Table 193). 


g. Multiple correlations computed on 
the Experimental Report indicated that 
prediction was in some instances, but not 
in others, increased by the use of more 
than one Report section. 


All multiple correlations were significantly 
different from zero at the .o1 level or below. 
Five of the six correlations based on the Work 
Performance criterion represented a significant 
increase (.05 level or below) in validity over 
that obtained on the best single Report section 
for each officer group. Only one of those based 
on the Personality criterion, however, showed 
such a significant increase (Table 11). 


10. The combination of sections of the 
Experimental Report which produced 
the maximum correlation with the cri- 
teria, as determined by the Wherry- 
Doolittle method, differed for each of the 
officer groups studied, but tended to in- 
clude the Forced Choice in combination 
with one of the rating scale sections, 
usually Personal Qualifications. 

Of 12 multiple correlations computed, six 
involved the Forced Choice and Personal Quali- 
fications sections as the only selected predictors, 
and three involved these two sections in combi- 
nation with a third section. In one multiple 
correlation the Forced Choice section was se- 
lected in combination with the Job Proficiency 
scale, and each of these sections was the only 


predictor selected in the two remaining multiple 
correlations (Table 11). 
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IMPLICATIONS OF THE FINDINGS 


Evaluation of the performance of 
highly trained professional personnel 
poses a difficult measurement problem. 
The complexities of the work require- 
ments for such personnel make adequate, 
objective criteria of professional compe- 
tency difficult to obtain at the present 
time. Any criterion or criteria should 
presumably reflect such personal ch ~-- 
teristics as professional knowledge, judg- 
ment, technical skill, originality, emo- 
tional adjustment, and ability to ad- 
minister programs in a professional spe- 
cialty. While the inadequacies of the 
type of criterion employed in the pres- 
ent work are recognized, practical con- 
siderations necessitated the use of a con- 
ventional work-associates’ rating method. 

With the type of item analysis and 
control of experimental variables used 
in this study, it would appear that, 
within the limitations imposed by a rat- 


ing criterion, satisfactory validity and 
reliability of performance evaluation 
methods for professional health person- 
nel can be obtained. Of particular inter- 
est are the results obtained for the forced 
choice tetrads which, under the condi- 
tions of this study, generally produced 


higher validity coefficients than other 
methods of assessing or reporting effi- 
ciency. Since rating-scale methods of 
efficiency reporting have widespread 
usage, it may also be of general interest 
that these methods produced satisfactory 
validity as measures of professional per- 
formance. 

The findings appear to be applicable 
to other organizations employing medi- 
cal, scientific, and other health personnel 
similar to those employed by the Public 
Health Service. With regard to the forced 
choice items, it may be recalled that the 
items used here, although scored by keys 
developed from item analysis of Experi- 
mental Efficiency Reports completed on 
Public Health Service personnel, were 
developed in another organization on an 
employee population quite different 
from that of the Public Health Service. 
From the evidence concerning validity 
of the tetrads in the variety of work 
activities in the Public Health Service 
(medical care, research, and _ public 
health), it may be inferred that the item 
content and the technique are such 
as to be relevant in a number of different 
employment situations. 
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APPENDIX 


A. SAMPLES OF ITEMS FROM SECTIONS OF THE 
EXPERIMENTAL EFFICIENCY REPORT 


Section I. Forced Choice 


Directions for completing: From each of the | Most | Least 
following sets of four words or phrases, mark 4 Cannot assume responsi- 
the one word or phrase in each set which is bility 
“most descriptive” and the one which is “least 
descriptive” of the officer you are rating. B. Knows how and when to 
delegate authority 


C. Offers suggestions 
| Most | Least 


D. Too easily changes his 
. A go-getter who always ideas 
does a good job 


. Cool under all circum- 
stances 


. Doesn't listen to sugges- A. Modest and reserved 


oa B. Doesn’t have the drive or 


. Drives instead of leads force he should 


C. Antisocial 


D. Respected by all fellow 
officers 


Section II. Job Proficiency 


Directions for completing: From the Service are evaluating. Rate the officer’s job proficiency 
functions listed below, select the one you con-_ in this function by marking a position on the 
sider to be the primary job of the officer you _ ten-point scale. 


1. Operation in a technical or specialized Public Health program 

2. Care of patients or furnishing services to patients 

3. Administration of a clinical or medical care program at any level 
4. Directly performing research work 


FORTRATING OFFICER I 2 4 6 8 10 | 
Number of Function ee 


Section III. Personal Qualifications 


Directions for completing: By marking a position on a ten-point scale, rate the officer on each 
of the following personal qualifications. 


The degree to which he is%able to discriminate & eval- 2 
ate facts to arrive at logical conclusions. t i 


6 8 10 
i 
The degree to which he isable to carry out orderswith 1 2 ; ; 


The degree to which his appearance and behavior 4 
cause people to react favorably. eat 


consistency & firmness to achieve objectives 


6 


ll 
A 
I 
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Section IV. Check List statement applies to the officer under considera- 


mle tion. If a statement does apply, mark space one 
Directions for completing: From the follow- (1); if it does not, mark space two (2). 


ing statements, determine whether or not each 


Applies Does not Apply 


This officer has a broad and detailed knowledge of his profession 


This officer’s usefulness is limited to a narrow field 


This officer does an excellent job of planning and organizing his 
work 


B. SAMPLES OF ITEMS FROM SECTIONS OF THE 
OFFICER’S PROGRESS REPORT 


Rating Scales 


Indicate rating by check mark oo i Good Excellent 


Judgment 

General professional knowledge 
Proficiency in assigned duties 
Tact 

General fitness for the service 


Questions Eliciting Narrative Comments 


Are you satisfied to have this officer? Yes [] No [ Give reasons 
Handicaps 
What are your recommendations for this officer's improvement? 


Remarks 
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